OTS/9000 Performance Information -------------------------------- ************************************************************************ * DISCLAIMER : * * This document is intended only for general information purposes. * * HP specifically disclaims any and all warranties, expressed * * or implied, including but not limited to the warranties of * * merchandability or fitness for a particular purpose with respect to * * the information in this document. * * The information herein is subject to change without notice. * ************************************************************************ The following information is intended as a guide to performance tuning and configuration for the C.03.00 OTS stack. This section contains the following parts : A. Factors That Influence OTS Performance Terminology and Concepts Performance Factors Use of Performance Numbers B. OTS Performance Test Model C. OTS Performance Throughput Rate Request-Reply Performance Performance Tuning A. Factors That Influence OTS Performance There are many factors that influence overall performance of a network application. The purpose of this section is to define some common terms and to describe (loosely) their interrelationship with respect to performance. A-1) Terminology and Concepts o A TSDU (Transport Service Data Unit) is the logical unit of information transmitted between peer transport service, or XTI, users. A TPDU (Transport Protocol Data Unit) is the physical unit data transmitted between peer transport entities across the network. One or more TPDUs are used to transmit one TSDU. The "window" size determines the number of TPDUs that the transport entity can transmit before waiting for an acknowledgment from the peer entity. A TIDU (Transport Intermediate Data Unit) is the physical unit of data passed internally between the application and the transport entity (OTS) on the same system.Typically, one or more TIDUs are used to pass one TSDU. For the XTI send (t_snd) and receive (t_rcv) functions, the T_MORE flag is used to indicate whether or not the TSDU segment is the last part of the complete TSDU. o An SSDU (Session Service Data Unit) is the logical unit of information transmitted between peer session service users. The session entities transmit information in the form of SPDUs (Session Protocol Data Units) using the transport service. A session entity may transmit one or more SPDUs in a single TSDU. The SSDU size for session version 2 normal data transfer is unlimited. OTS segments an SSDU of normal data into SPDUs based on the underlying TPDU size. o An PSDU (Presentation Service Data Unit) is the logical unit of information transmitted between peer presentation service, or APLI, users. The presentation entities transmit information in the form of PPDUs (Presentation Protocol Data Units) using the session service. A presentation entity may transmit one or more PPDUs in a single SSDU. NOTE: We have simplified the conceptual model by stopping at the Transport layer. Actually each TPDU is segmented into one or more Network layer packets; and for X.25, each Network packet is segmented into one or more Data Link layer frames. The Data Link frame is the physical unit of data transmitted between systems. In OTS, applications run in user space, and the protocol entities are in the kernel, under Streams. When an application issues an open(2) system call to open a Streams device, a pair of Streams queues is created in the kernel (one for sending, and one for receiving). The XTI, Session, and APLI libraries use the Streams system calls putmsg(2) and getmsg(2) to communicate using TIDUs, SIDUs,and PIDUs, respectively, with the OTS protocol stack. The maximum TPDU and window sizes are parameters of the OTS stack. The default maximum TPDU and window sizes are user-configurable, but these can be negotiated downward during connection establishment with a peer Transport entity. The TSDU size is controlled by the XTI application. If the TSDU size exceeds the maximum TIDU size, the XTI library will segment the TSDU into multiple TIDUs. In this section, we will assume that one TIDU is used for each TSDU for an XTI application. That is, the performance tests ensure that the TSDU is not so large that it must be segmented into multiple TIDUs. Consequently, in discussing the interactions between an application and the OTS stack, we will speak only of TSDUs, even though TIDU might be a more appropriate term. When discussing the performance of the XTI, Session, or APLI application below, we will often use the term SDU to refer to a TSDU, SSDU, or PSDU, respectively. SDU (Service Data Unit) is meant to be a generic term for the unit of data seen by an application. A-2) Performance Factors Larger TSDUs require fewer transfers between the application and the OTS stack. This results in fewer context switches and lower system overhead. For example, if an application is sending 100K bytes in a file transfer, this requires 100 transfers if TSDU size is set to 1024; but it requires only 25 transfers if the TSDU size is 4096. Similarly, larger TPDUs require fewer transfers between peer transport entities across the network. Usually this results in better throughput, due to decreased bandwidth used for protocol headers and acknowledgements. For example, if the OTS stack transmits a 2000 byte TSDU, this requires 17 transfers if the TPDU size is set to 128; but it requires only 1 transfer if the TPDU size is 2048. Larger window sizes allow the sending system to transmit more TPDUs before waiting for the receiving system to acknowledge their arrival. This increases the concurrent operation of the two systems. Thus, we might expect that large TSDUs, large TPDUs, and large windows would maximize throughput. However, extremes in these directions can have a negative impact on performance because of other system and network factors. For example, if we use a large window size and the receiving system is much slower than the sender, the sender is likely to overrun the LAN interface on the receiving system, and TPDUs will be dropped (lost). This will result in Transport retransmissions. Also, a lot of kernel memory may be consumed in the systems. When multiplied over a large number of connections, this can result in overall system performance degradation. Retransmissions can have a significant impact on effective performance for two reasons. First, because the timer resolution is one second, even one retransmission on a 10 Mbit/sec LAN can devastate throughput, even for transfers of a megabyte of data. Second, the sender must retransmit all of the unacknowledged TPDUs; typically, this is an entire window. For larger window and TPDU sizes this means retransmitting more data. Large TPDU sizes, whether you are using XTI, the Session Layer Access, or APLI, also causes the stack to take up more memory for PDUs in internal protocol and Streams queues. This can degrade the performance of the entire system if there are large number of OTS connections sending and/or receiving data. Consequently, a delicate balance must be found among these parameters in order to maximize performance. Unfortunately, finding this balance is not an exact science. Experimentation with each situation is required. B. OTS Performance Test Model The Following important points have to be noted about the performance measurements below: - Measurements made on an series 842 were done when it was communicating with an 827. Therefore whether the 842 was the sender or the receiver, it was bottlenecked by its own speed (and the link, of course). Similarly, measurements of the 720 and 827 were done against the series 750, except where noted. - Except where noted, all the stack parameters were set to the default values. - Only unidirectional data was considered while measuring throughput. - All X.25 measurements were made using an Amnet switch with the line speed set to 64 Kbps. - All the LAN measurements were made on an isolated LAN segment. - Superfluous processes running in the system were killed during the measurements (notably, the NFS daemons, syncer, cron, sendmail, etc.). - In all the measurements over LAN, the througput rate was best when the SDU size used was a little under 4096. The lone exception was the XTI receiver. The XTI receiver showed a throughput of 704 Kbytes/Sec, utilizing 49.5 % of an 842 at an SDU size of 3840. - In all the measurements over X.25, the throughput rate was best when the SDU size used was 2304 bytes (for the 842), or when the SDU size was a little under 4096 (for the 827 and 750). - All measurements were made only after the connection was established (to factor out connection establishment costs). - All Request-reply measurements were made with the Requester sending 'n' bytes as a request, and the reply being a fixed 100 bytes in size. Data has been given for different values of the request size. Performance of a customer's network will depend upon the stack parameters, the number of systems in the LAN (LAN traffic), the number of subnets configured, the amount of physical memory available, and the SDU sizes used. For example, if large number of connections are in operation, all of them pumping lot of data into the network, it may be possible that some PDUs are dropped due to buffer shortages, and this can result in retransmissions, thereby affecting performance. Even if a lot of other tasks (unrelated to OTS) are being run on the system, the performance of OTS can degrade considerably because of contention for CPU and memory between OTS and the other tasks. Using SDU sizes higher than 4096 bytes may result in slightly higher performance, but the CPU utilization figures may increase considerably. However, using SDU sizes higher than 4096 in combination with running large number of connections can cause (depending on how much data is handled by each connection) performance degradation of the entire system, as it takes up a lot of kernel memory. Measured performance can also be affected by the total amount of data transferred during the lifetime of a connection. This is because there is a fixed overhead in connection establishment, congestion avoidance algorithms used by Transport, delayed acknowledgement, etc. C. OTS Performance C-1) Throughput Rate: Figure 4 gives the peak observed throughput using XTI, Session and APLI programmatic access methods over 802.3. Figure 5 gives the throughput over X.25. All measurements were made on an 842 communicating with a 827. CPU Utilization measurements indicate that of an 842. -------------------------------------------------------------- | SENDER | RECEIVER | ------------------------------------------------------|------- | Thruput |CPU Util | SDU | Thruput | CPU Util | SDU | | (Kb/Sec) | | | (Kb/Sec) | | | ------------------------------|-----------------------|------- XTI | 809.0 | 45.3 % | 3840 | 714.3 | 52.2 % | 2816 | |----------|-----------|------|----------|------------|------- Session | 768.6 | 59.9 % | 3840 | 641.59 | 63.9 % | 3840 | |----------|-----------|------|----------|------------|------- APLI | 807.0 | 51.9 % | 3840 | 691.3 | 58.2 % | 3840 | |----------|-----------|-----------------|------------|------| Figure 4: Peak Throughput measurements over 802.3 LAN. -------------------------------------------------------------- | SENDER | RECEIVER | -------------------------------------------------------------- | Thruput |CPU Util | SDU | Thruput | CPU Util | SDU | | (Kb/Sec) | | | (Kb/Sec) | | | ------------------------------|-----------------------|------- XTI | 5.17 | 0.54 % | 2304 | 5.11 | 1.32 % | 2304 | |----------|-----------|------|----------|------------|------- Session | 5.18 | 0.80 % | 2304 | 5.12 | 1.65 % | 2304 | |----------|-----------|------|----------|------------|------- APLI | 5.18 | 0.62 % | 2304 | 5.05 | 1.65 % | 2304 | |----------|-----------|-----------------|------------|------| Figure 5: Peak Throughput measurements over X.25 (1 connection) C-2) Request-Reply Performance Figures 6 and 7 show the request-reply performance of an 842 communicating with an 827 over 802.3 and X.25. Transactions per second and CPU utilization are shown for two SDU sizes, a "small" 100 byte SDU, and a "large" 3840 byte SDU. -------------------------------------------------------------- | REQUESTER | REPLIER | -------------------------------------------------------------- | Trans |CPU Util | Req | Trans | CPU Util | Req | | per sec | | Size | per sec | | Size | ------------------------------|-----------------------|------- XTI | 189 | 33.4 % | 100 | 190 | 35.3 % | 100 | | 92 | 27.6 % | 3840 | 91 | 30.8 % | 3840 | |----------|-----------|------|----------|------------|------- Session | 124 | 42.7 % | 100 | 125 | 42.4 % | 100 | | 73 | 34.0 % | 3840 | 72 | 37.1 % | 3840 | |----------|-----------|------|----------|------------|------- APLI | 149 | 37.2 % | 100 | 149 | 37.2 % | 100 | | 81 | 29.9 % | 3840 | 79 | 33.5 % | 3840 | |----------|-----------|------|----------|------------|------| Figure 6: Request-Reply measurements over 802.3 LAN. -------------------------------------------------------------- | REQUESTER | REPLIER | -------------------------------------------------------------- | Trans |CPU Util | Req | Trans | CPU Util | Req | | per sec | | Size | per sec | | Size | ------------------------------|-----------------------|------- XTI | 7.6 | 1.64 % | 100 | 8.2 | 1.82 % | 100 | | 1.2 | 0.45 % | 3840 | 1.1 | 1.33 % | 3840 | |----------|-----------|------|----------|------------|------- Session | 7.6 | 3.11 % | 100 | 7.6 | 3.11 % | 100 | | 1.2 | 0.71% | 3840 | 1.2 | 1.62 % | 3840 | |----------|-----------|------|----------|------------|------- APLI | 7.6 | 2.32 % | 100 | 7.6 | 2.35 % | 100 | | 1.2 | 0.58 % | 3840 | 1.1 | 1.60 % | 3840 | |----------|-----------|------|----------|------------|------| Figure 7: Request-Reply measurements over X.25 (1 conection) C-3) Performance Tuning This following data provides guidance in configuring OTS for optimal performance. OTS througput and request-response performance are compared for OTS running on different processors and network interfaces. The effect of the TP4 checksum calculation on performance is measured, and throughput measurements are taken for varying TPDU and window sizes. Figure 8 shows the relative performance of the 842, 827, 720, and 750 over 802.3 using XTI. Measurements were taken with an 842 sending data to an 827, the 827 sending to a 750, and the 720 and 750 sending to each other. In general, the relative performance of the other APIs to XTI should be similar to their relative performance as measured on the 842. The performance of machines other than those shown here can best be estimated by using data for a machine with a similar processor and I/O architecture. For instance comparisons can be made between the 827 and other members of the 8x7 family of machines, or between the 720 and 750, by scaling the CPU utilization by processor speed. -------------------------------------------------------------- | SENDER | RECEIVER | -------------------------------------------------------------- | Thruput |CPU Util | SDU | Thruput | CPU Util | SDU | | (Kb/Sec) | | | (Kb/Sec) | | | ------------------------------|------------------------------- 842 | 809.0 45.3 % 3840 | 704.9 49.6 % 3840 | 827 | 1077.7 47.6 % 3840 | 1113.6 72.0 % 3840 | 720 | 1127.6 37.8 % 3840 | 1140.4 50.9 % 3840 | 750 | 1140.4 29.4 % 3840 | 1127.6 36.0 % 3840 | |-----------------------------|------------------------------| Figure 8: Send throughput for systems over XTI/802.3 -------------------------------------------------------------- | SENDER | RECEIVER | -------------------------------------------------------------- | Thruput |CPU Util | SDU | Thruput | CPU Util | SDU | | (Kb/Sec) | | | (Kb/Sec) | | | ------------------------------|------------------------------- 842 | 5.08 0.35 % 3840 | 5.04 1.27 % 3840 | 827 | 5.19 0.39 % 3840 | 5.36 1.00 % 3840 | 750 | 5.36 0.52 % 3840 | 5.19 0.69 % 3840 | |-----------------------------|------------------------------| Figure 9: Send throughput for systems over XTI/X.25 (1 connection) Figures 10 and 11 show a comparison of different systems for request-reply traffic. As before the machines were paired as 842-827, 827-750, 720-750, 750-720. Request-reply traffic is more dependent than throughput on the speed of the partner machine, therefore, the raw number of transactions per second is not as important as the amount of CPU power needed to process each transaction. -------------------------------------------------------------- | REQUESTER | REPLIER | | Trans |CPU Util | Req | Trans | CPU Util | Req | | per sec | | Size | per sec | | Size | ------------------------------|------------------------------- 842 | 189 33.4 % 100 | 190 35.3 % 100 | | 92 27.6 % 3840 | 91 30.8 % 3840 | ------------------------------|------------------------------- 827 | 274 45.7 % 100 | 265 45.2 % 100 | | 131 20.4 % 3840 | 129 38.6 % 3840 | ------------------------------|------------------------------- 720 | 489 58.8 % 100 | 492 59.8 % 100 | | 176 30.8 % 3840 | 177 37.8 % 3840 | -------------------------------------------------------------- 750 | 492 41.6 % 100 | 489 43.1 % 100 | | 177 22.9 % 3840 | 176 27.1 % 3840 | |-----------------------------|------------------------------| Figure 10: Request-Reply measurements for systems over XTI/802.3 -------------------------------------------------------------- | REQUESTER | REPLIER | -------------------------------------------------------------- | Trans |CPU Util | Req | Trans | CPU Util | Req | | per sec | | Size | per sec | | Size | ------------------------------|------------------------------- 842 | 7.6 1.64 % 100 | 8.2 1.82 % 100 | | 1.2 0.45 % 3840 | 1.1 1.33 % 3840 | ------------------------------|------------------------------- 827 | 8.1 2.05 % 100 | 8.2 2.08 % 100 | | 1.2 0.50 % 3840 | 1.2 1.01 % 3840 | ------------------------------|------------------------------- 750 | 8.2 1.41 % 100 | 8.1 1.41 % 100 | | 1.2 0.58 % 3840 | 1.2 0.72 % 3840 | |-----------------------------|------------------------------| Figure 11: Request-Reply measurements for systems over XTI/X.25 (1 connection) The effect of using the optional TP4 checksum is measured in Figure 12, tested between two 827 machines running an XTI application. The checksum calculation takes an additional 31-53% CPU utilization, depending on whether the receiver or sender is considered, while delivering 11% less throughput. ------------------------------------------- | Thruput | CPU Util | CPU Util | SDU | | (Kb/sec) | Sender | Receiver | | ------------------------------------------- with checksum | 821.4 64.0 % 79.6 % 3840 | without checksum | 927.0 41.8 % 60.6 % 3840 | |-----------------------------------------| Figure 12: Effect of TP4 checksum on XTI throughput over 802.3 Varying the TPDU and/or transport window size is another way of trying to tune performance. Setting a larger TPDU size will allow OTS to operate more efficiently, handling larger chunks of data at a time. However, this may be offset by the increased penalty for retransmissions, depending on the reliability and congestion on the LAN. Also, a larger TPDU or window size increases the number of kernel buffers which are needed to store data which is pending acknowledgement or is waiting for delivery to the application. The data shows the results of varying the TPDU size using the default window size, then varying the window size with the default TPDU size. All measurements were taking over XTI between two 827s. No measurements are shown using an 8192 byte TPDU, because in the Streams environment, XTI TSDUs of greater than 4 KB will be fragmented into smaller TIDUs. The performance for TPDU of 8192 bytes will be similar to that shown for 4096. ----------------------------------------------------------- | TPDU | window | Thruput | CPU Util | CPU Util | SDU | | | size | (Kb/sec) | Sender | Receiver | | ----------------------------------------------------------- | 512 3 421.4 59.9 % 76.5 % 3960 | | 1024 3 598.4 61.0 % 69.9 % 3960 | | 2048 3 601.5 57.1 % 69.6 % 3960 | | 4096 3 828.9 63.1 % 78.8 % 3960 | ----------------------------------------------------------- | 4096 3 828.9 63.1 % 78.8 % 3960 | | 4096 6 972.0 70.6 % 89.3 % 3960 | | 4096 10 987.7 68.9 % 89.2 % 3960 | | 4096 12 991.5 68.9 % 89.3 % 3960 | | 4096 14 988.1 68.0 % 88.7 % 3960 | |---------------------------------------------------------| Figure 13: Effect of varying TPDU and window size on XTI throughput over 802.3 Notice that performance increases significantly with increasing TPDU sizes. The best throughput seems to occur with a window size of 12, although the differences are not so dramatic. Again, these measurements were taken under ideal conditions over an isolated LAN. Figure 14 shows the same measurements taken over X.25 with the default CONS TPDU and window sizes. The best throughput comes with a TPDU size of 256 bytes, although this comes at the cost of some additional CPU cycles. Also, note that the performance only improves slightly with increasing window size. This confirms our expectation that X.25 performance is limited by the speed of the X.25 interface and the X.25 subnetwork, and not by processing or protocol delays in the OSI stack. ----------------------------------------------------------- | TPDU | window | Thruput | CPU Util | CPU Util | SDU | | | size | (Kb/sec) | Sender | Receiver | | ----------------------------------------------------------- | 128 5 4.98 2.12 % 3.75 % 3840 | | 256 5 5.54 1.38 % 2.53 % 3840 | | 512 5 5.15 0.78 % 1.47 % 3840 | | 1024 5 5.10 0.51 % 1.25 % 3840 | | 2048 5 5.05 0.35 % 1.23 % 3840 | ----------------------------------------------------------- | 2048 2 5.04 0.35 % 1.17 % 3840 | | 2048 3 5.05 0.35 % 1.21 % 3840 | | 2048 5 5.05 0.35 % 1.23 % 3840 | | 2048 9 5.05 0.35 % 1.26 % 3840 | | 2048 15 5.06 0.36 % 1.30 % 3840 | |---------------------------------------------------------| Figure 14: Effect of varying TPDU and window size on XTI throughput over X.25 (1 connection) ....................................................................... Item Subject: OTS PICS (April,1992) Below is a list of all the available transport and session PICS for OPUSk. Currently there is no automated way to retrieve these since we currently are limited to paper copies. Please send requests for OTS PICS to Jean-Yves RIGAULT at ENMC Grenoble. PICS completed to date: Session: COS session version 1.0 (MHS) COS session version 2.0 (FTAM) CTS-WAN (OSTC) PICS CTS-WAN (OSTC) session version 1.0 (MHS) INTAP (MHS) PICS INTAP (FTAM) PICS Transport/ CLNP: COS: TP4, CLNP (TA51, TA52, TA1111 profiles) COS: TP0 (TD1111 profile) US GOSIP 1.0: TP0, TP4, CLNP PICS CTS-WAN PICS: TP0, 1, 2, 3, 4 CTS-WAN (MHS) PICS INTAP PICS: TP0, 2 UK GOSIP PICS: TP0, 2, 4, CLNP, ES-IS Australian GOSIP PICS: TP0, 2, 4, CLNP ........................................................................